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Abstract 

Background: RNA-Seq is a technique that uses Next Generation Sequencing to identify transcripts and estimate 
transcription levels. When applying this technique for quantification, one must contend with reads that align to 
multiple positions in the genome (multireads). Previous efforts to resolve multireads have shown that RNA-Seq 
expression estimation can be improved using probabilistic allocation of reads to genes. These methods use a 
probabilistic generative model for data generation and resolve ambiguity using likelihood-based approaches. In 
many instances, RNA-seq experiments are performed in the context of a population. The generative models of 
current methods do not take into account such population information, and it is an open question whether this 
information can improve quantification of the individual samples 

Results: In order to explore the contribution of population level information in RNA-seq quantification, we apply a 
hierarchical probabilistic generative model, which assumes that expression levels of different individuals are 
sampled from a Dirichlet distribution with parameters specific to the population, and reads are sampled from the 
distribution of expression levels. We introduce an optimization procedure for the estimation of the model 
parameters, and use HapMap data and simulated data to demonstrate that the model yields a significant 
improvement in the accuracy of expression levels of paralogous genes. 

Conclusions: We provide a proof of principal of the benefit of drawing on population commonalities to estimate 
expression. The results of our experiments demonstrate this approach can be beneficial, primarily for estimation at 
the gene level. 



Introduction 

With the rapid decline in the cost of sequencing, RNA- 
Seq has emerged as a legitimate competitor to mi-croar- 
rays as a means of assessing global gene expression. 
Even as arrays currently enjoy a cost advantage, many 
new applications of information accessible only through 
sequencing further strengthen the case that sequencing 
may soon supplant arrays as the technology of choice 
for transcription analysis. One such application is fine- 
grained assessment of variation in expression and the 
sources for such variation, as exemplified by recent 
large-scale RNA-Seq studies [1,2] of two different 
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HapMap [3] populations. Such studies complement 
genomic DNA sequencing by elucidating the link 
between SNPs and expression. 

Unfortunately, with any new technology come its lim- 
itations, and several studies have pointed out that RNA- 
Seq is not immune to bias [4-6]. Perhaps the most 
widely discussed hurdle to accurate estimation in the 
case of RNA-Seq is the problem of reads mapped to 
multiple locations in the target genome (or in the target 
transcript sequences). These reads, which are called 
multireads, can stem from either paralogous gene 
sequences or from different isoforms of the same gene 
that share exons. 

Several methods have emerged to address the multi- 
read problem for paralog and isoform estimation [7-10]. 
These methods are all based on probabilistic modeling 
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that is optimized by an expectation maximization proce- 
dure. It has been repeatedly shown that by using such 
methods one can get better quantification of the expres- 
sion levels compared to quantification based on naive 
approaches of read assignment. 

In many applications, a set of samples is studied. For 
instance, one may be interested in comparing the 
expression levels in cases verses controls, or in tissues 
originating from different organs. In such cases, it is 
plausible that the commonality of expression patterns 
within each of the defined groups of studied samples 
may be used to improve the quantification results in 
each of the samples. We demonstrate that by analyzing 
expression profiles of a population together, one gets 
expression estimates more accurate than those obtained 
by estimating individual sample expression levels inde- 
pendently. We describe and implement a probabilistic 
model of the sequencing process that incorporates 
population commonalities, and demonstrate its advan- 
tages over existing methods in the population setting. 

Methods 

RNA-Seq multiread expression estimation 

As we seek to extend the prevalent generative model of 
RNA-Seq [7-11], we begin by reviewing the basic elements 
of that model. Let G - (G v ...,G M ) be the set of M tran- 
scribed regions considered and P - (Pi, ...,Pm) he the pro- 
portions of RNA bases attributed to each transcript out of 
the total number of transcribed bases in a sequenced sam- 
ple. Regions may be either genes or transcripts, depending 
on the level of transcription being investigated. We require 
P to satisfy L geG P g =l and VgcG, 0 < P g < 1. 

The model describes an RNA sequencing experiment 
where regions in G are randomly chosen according to 
the distribution P, start positions in these regions are 
chosen uniformly, and reads of length t are generated 
by copying t consecutive bases from each chosen region 
to produce a set of reads R - (ri,..., r p ). Sequencing is 
assumed to be error prone, leading to a certain prob- 
ability of error for each read base. Based on the repeti- 
tions present in the set of regions and errors in 
alignment, reads may fail to map to the region from 
which they originate or may map to additional locations. 
Thus, we assign a probability of obtaining read r ; given 
that it originated from region 

(I _ g ^-error jk ) g error jk 



G k ,P{rj\G k ) = 



4 



-. In this case we rely 



on the alignment of r ; to G k to afford us the best match 
position instead of summing over all possible starting 
positions, i k is the effective length of G k (i.e., the num- 
ber of start positions from which a full length read can 
be derived) as defined in [11], e is taken to be a constant 
per-base error rate, errors are assumed to be 



independent, and error ) k is the number of mismatches in 
the best alignment of r ; to G k . 

This formulation leads to the likelihood of observing 
the data: 

p p M 

L(P;R) = Y\P(rj\G,P) = f] 2>(G fe }P(r}|G fe ) (1) 



j=i fe=i 



This likelihood function is used to estimate P given 
the read alignments. Typically, one will use expectation 
maximization to find the P for which the likelihood is 
maximized. It is assumed that P(rj\G k ) is zero for all 
regions to which r ; is not aligned. 

Common population extension 

To estimate expression levels in N individuals from a 
defined population, we modify the above model by 
assuming that samples are drawn from a common popu- 
lation. This is imposed by having P = [(P n , ... t PiM)t 
(Pni ~<>Pn m)] be probability densities drawn from a 
common Dirichlet distribution, defined by a set of 
hyper-parameters specific to the population: V7e[l, A/], 
Pi = OPfi,..., P iM ) ~ Dirichlet (a l9 ..., a M ). 

For sample i, we denote the set of reads as 
Ri = ( r ii/ • • • > r ipi\ where each r t j is mapped to one or 
more regions in G. The output of a read alignment pro- 
gram defines the set of accepted regions for the read (in 
practice only alignments with up to 2 errors are accepted) 
and provides the number of errors in alignment for each 
read-region pair. This allows us to calculate P{rij\G k ) as 
done above for one sample. For convenience we denote P 
( r ij\G k ) - qtj k (taken to be zero for all regions not mapped 
to), which is independent of a and P. 

As before, our objective is to estimate P, but in this 
case we must optimize by estimating P and a together. 
We begin by writing the likelihood function: 

L( Pl/ . . . , p N/ a; R) = Pr( Pl/ . . . , p N \a)Pr{R\p v . . . , p N ) (2) 

Since expression values are sampled from the Dirichlet 
distribution, 

N N M 

mpi pni«) = n p (p.i«) = n^rK" 1 ^ 

Where 



i=i 



i=l k=\ 



C(a) = (4) 
and similar to (1) above, 



N Pi M 

pr^ip 1/ ... / p N )=nriE p ^ 

i=i j=i k=i 



(5) 



Rozov et al. BMC Bioinformotics 2012, 13(Suppl 6):S2 
http://www.biomedcentral.com/1471-2105/13/S6/S2 



Page 3 of 7 



This leads to 



where 



N M 



N Pi M 

i=i j=i k=i 



t(pi p N .<»;«)- n c (°)ri p ft" 1 

_i=l fe=l 

Taking the log, we get 

M N 

log[L(p, p N ,*;R)] = A%C(«) + 1)E 



(6) 



fe=l 
M 



(7) 



Multi-Genome Multi-Read (MGMR) algorithm 

We wish to estimate a and Pi,...,p n to maximize equa- 
tion (7) above. For this purpose, we adopt an alternating 
iterative procedure of estimating a given the current 
estimate of Pi,...,p n and vice-versa until the total change 
in a becomes sufficiently small (or until a pre-set num- 
ber of iterations have been executed). 

Although for EM-based estimation methods convexity 
guarantees an optimal solution will be obtained, here (as 
shall be seen below) we have no such guarantee. Thus, we 
confine updates to be local by performing only one update 
for P and one for a. By one MGMR iteration, we refer to 
one EM-based P update followed by one a update. 
Estimating P given a 

If we assume a is given, we can write the EM steps to 
find Pi,...,p N : 

E step Letting Match signify a matching between 
reads and regions, and Match(j) be the region from 
which read j originates, we get: 



log[L{T>, a; R, Match)] = NlogC{a) + E ~ ^ E lo 8 Pik 

k=l i=l 

N pi 

+ E E ( l °8( P iMatch(jrtiMatch(j))) 
i=l j=l 



(8) 



which leads to 
Q(P, «|PM, «W) = E Mab:mpll)Mt) [log{L)] (9) 



N Pi M 



+ EEE(^ + 'w#)*7i^ 



(10) 



j=i j=i k=i 



V M P {t) d~u 

2^k=i 1 ik ^ 



M N N Pi M 

= N%C(aW ) + J2 " 1 ) E W* + E E E o*fr&* 

fe=l i=l i=l j=l fe=l 

N Pi M 

+ EEE 

i=i j=i fe=i 



(ii) 



(12) 



M step Given that each q t j k is fixed, the above reduces 
to maximizing 



M N 



NtogC(aW) + £ (a W _ i) ^ fo^ 

fe=i i=i 

N pi M 

+ E EE ^ * lo s Pik 

i=l j=l k=l 



(13) 



Using Lagrange multipliers and differentiating, we see 
that this is maximized with 



P 



,( t+ i) _ a fe°- 1 + Ej^fe 



(14) 



Estimating a given P 

Given a new estimate for we can use a fixed point 
iteration [12] to get a new estimate of a 



A4 



fe fe 
N 



fe=i 

By using 



i=i 



the known 



(15) 



bound 



dlogV (x) 

F(x) > r(x)exp((x-x)^(x)) (having^(x) = — > 



we can get a lower bound on F(a): 



(16) 



+ Y^aklogP^ +Const[P®) 



where logP k = — * Ei 

We maximize tnis bound with a fixed point iteration 
similar to EM, noting that for fixed values of P conver- 
gence is guaranteed, and that for the Dirichlet distribu- 
tion the maximum is the only stationary point [12]. This 
leads to the update 



(17) 



Heuristics/lmplemen tation 

As we have found F(a) presented in equation (15) is 
non-convex even in 2 dimensions (Figure 1), we confine 
updates to be local by allowing only one update for 
both the a and P estimation steps at each MGMR 
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iteration. For genes with EM expression estimates equal- 
ing zero in all samples we substitute 10" 20 for their 
values in MGMR to avoid taking the log of zero. For P 
updates (e.g., equation 14), we avoid potentially negative 
P values by adding one to each alpha (thus ignoring -1 
in the numerator and denominator). We implemented 
the algorithm in MATLAB, where the inputs required 
are read-gene map files for each sample as in SEQEM 
[7], and an initial P estimate matrix. Alphas are initia- 
lized as an M-length vector of ones. 

Results 

Experimental setup 

As in [7,9,10], we examined MGMR's accuracy by com- 
paring its estimates of known expression levels with 
those of existing methods, namely SEQEM [7] and RSEM 
[9,10]. The initial "known" expression levels were esti- 
mates obtained from RNA-Seq samples; how these esti- 
mates were obtained is described below. In our case, we 
had to simulate a population sharing similar expression 
levels and the same set of gene regions. Our experiment 
differed in that we sought to use additional information 
to improve on the estimates of these existing methods. 
These methods were designed to estimate expression of 
single samples, and each had specific advantages which 
we disregarded in our comparison. For example, we 
ignored both SEQEM's ability to utilize SNP information 
and RSEM's ability to allow estimation on assembled 
transcripts by using only reference sequences. 

Simulating data 

To derive artificial reads, we first estimated expression 
on real biological samples using one method and then 



used the resulting distribution of expression values to 
generate simulated datasets for testing. Real samples 
were drawn from lymphoblastoid cells of the Yoruba in 
Ibadan (YRI) population [2,3]. As MGMR requires initial 
expression estimates, the estimate derived from the 
method it was being compared with in each case was 
input to MGMR. Thus, the four initial estimates used 
were from SEQEM, MGMR(SEQEM) (namely, MGMR 
initialized by SEQEM's result), RSEM, and MGMR 
(RSEM) (namely, MGMR initialized by RSEM esti- 
mates). We denote these four estimates SEQEM-A, 
SEQEM-B, MGMR-A and MGMR-B, respectively. We 
simulated reads by randomly selecting start sites falling 
within gene boundaries and extracting sequences from 
those positions. Read lengths were defined for each 
experiment, and simulations were repeated multiple 
times to account for randomness in the sampling 
process. 

To derive the sequence set for the SEQEM compari- 
son, we expanded upon the procedure used in [7]. 
There, SEQEM was shown to improve estimation of 
paralogous gene expression on a set of exon sequences 
from 51 Homo Sapiens chromosome 1 paralogs from 
the HomoloGene [13] database. We extracted a larger 
set of sequences containing all HomoloGene paralogs in 
Homo Sapiens having at least one exon longer than 
twice the read length used that do not overlap in geno- 
mic coordinates. We required this minimal length 
because sequences were sampled from exons, and we 
needed to ensure enough positions existed for full 
length reads to be sampled from these exons. 285 such 
genes remained (for reads of length 35 bp), and these 
were the genes on which expression was tested and 
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from which read sequences were derived. The SEQEM- 
A and SEQEM-B read sets were generated based on 
randomly selected exons from these genes and the 
expression levels from the SEQEM-A and SEQEM-B 
estimates taken on 20 YRI individuals. The read length 
of 35 bp corresponded to that of the YRI samples, and a 
coverage level of 20 was chosen, as this was the level at 
which SEQEM was shown to perform best in [7]. We 
performed a total of 30 repetitions of read simulations, 
where each repetition consisted of 20 samples (corre- 
sponding to the original 20 YRI samples used). 

For the RSEM-A and RSEM-B read sets, the transcript 
set used was also obtained by filtering the HomoloGene 
database to avoid gene overlaps, but no length filtering 
was required: reads were now sampled directly from 
transcripts which all had effective lengths greater than 
the read length used. 524 transcripts corresponding to 
265 genes survived this filtering. For these read sets, we 
produced 30 repetitions of 74 samples, where each con- 
sisted of 100 bp reads at a coverage level of 20. In all 
other respects the sampling process and read generation 
steps were identical to those performed for the SEQEM- 
A and SEQEM-B read sets. 

Error measures 

Accuracy was assessed by three error measures, the first 
two of which were applied in [7]: error rate, computed 



as 



Qi 



difference, computed as 



and Kullback-Leibler divergence, 



computed as Pflog^-. Here P is the estimated dis- 

1 Qi 

tribution generated by the tested algorithm and Q is the 
true distribution. Error measures were averaged over all 
repetitions per sample, and then over all samples. 



Simulated data with priors based on real estimates - 
estimating paralogous gene expression 

To test performance on paralogous gene estimates, we 
set out to compare independent sample SEQEM esti- 
mates with MGMR's common population estimates. 
Before applying SEQEM, we looked to address one criti- 
cism of it from [11], where it was suggested that 
SEQEM's estimation could be improved by incorpora- 
tion of transcript length correction. Upon examination, 
we found the effect of this correction was an increase in 
accuracy, and thus we maintained it for subsequent 
tests. This improvement is documented in the appendix. 

With this correction in place, we estimated expression 
levels on the SEQEM-A and SEQEM-B read sets, apply- 
ing SEQEM and MGMR to each. Outputs were 
recorded at 1-10, 20, 30, 40, 50 and 100 iterations for 
MGMR and at 100 iterations for SEQEM. The results 
are shown in Figure 2. We observed that both error and 
variance levels dropped sharply within just a few itera- 
tions for MGMR, and converged to significantly better 
estimates on average than SEQEM. These trends were 
consistent across all three error measures [Table 1]. 
Variance seemed to diminish more with MGMR over 
time, as might be expected for a method that shares 
information across samples. Notably, MGMR 



-MGMR on SEQEM-A 
SEQEM on SEQEM-A 
MGMR on SEQEM-B 
SEQEM on SEQEM-B 




log 1Q { MGMR Iterations) 

Figure 2 Relative error measured on SEQEM-A and SEQEM-B data sets. MGMR outputs on SEQEM-A and SEQEM-B initializations were 
compared with SEQEM up to 100 iterations. MGMR outputs were recorded at 1-10, 20, 30, 40, 50 and 100 iterations. The first few iterations have 
been trimmed to allow a compact presentation. 
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Table 1 MGMR vs. SEQEM error at 100 iterations on 
SEQEM-A and SEQEM-B data sets 





SEQEM-A sampling 




SEQEM-B sampling 


SEQEM MGMR 


SEQEM 


MGMR 


Error SD Error SD 


Error 


SD 


Error SD 


E 


1.27 1 * 10~ 2 1.03 0.14 


1.50 


0.70 


0.82 6 MO" 3 


I 2 


0.66 2 MO" 3 0.22 4 MO" 3 


0.69 


0.05 


0.27 1 M0" 4 


KL 


0.29 7 MO" 4 0.14 1 * 10" 4 


0.18 


2 * 10" 4 


0.17 1 * 10" 4 



These data sets were derived from SEQEM and MGMR(SEQEM) estimates, 
respectively, on 20 YRI samples. (E: relative error rate; % 2 : Chi-squared error; 
KL: Kullback-Liebler divergence; SD: standard deviation) 



outperformed SEQEM on estimates for samples based 
on SEQEM-A, where sample estimates were obtained 
independently (and thus we expect the variation inher- 
ent in the real samples to be maintained). 

Simulated data with priors based on real samples - 
estimating transcript level expression 

We also sought to examine whether MGMR can 
improve results in the more challenging setting of esti- 
mating transcript level expression. Here, we expect esti- 
mates to be noisier due to low expression values in the 
real samples, and we must contend with multiread map- 
pings due to paralogous genes as well as to isoforms of 
particular genes sharing subsequences as a result of 
alternative splicing. In anticipation of this challenge, we 
used a larger set consisting of 74 sample of single-end 
YRI samples as the real data source and simulated 100 
bp reads instead of 35 bp. This was expected to be a dif- 
ficult case for estimation, as all genes in the set are 
paralogs and many have multiple isoforms, as described 
in the section "Simulating data." 

Once expression estimation was performed on the YRI 
samples and read sets RSEM-A and RSEM-B were gen- 
erated, we again performed expression estimates with 
RSEM and MGMR on each set. In this case, unfortu- 
nately, we found the results did not exhibit a consistent 
trend as before and overall appeared inconclusive. These 
results are summarized in Table 2. It remains to be seen 
why the error results differ according to the level of esti- 
mation (gene vs. transcript) performed. 



Table 2 MGMR vs. RSEM error at 100 iterations on RSEM- 
A and RSEM-B data sets 







RSEM-A Sampling 






RSEM-B Sampling 


RSEM MGMR 


RSEM MGMR 


Error 


SD Error 


SD 


Error 


SD Error SD 


E 


0.1 


1 MO" 3 0.69 1 


* 10" 3 


1.0 


1 * 10" 4 0.61 1 * 10" 3 


x 2 


0.02 


6 M0" 4 1.25 


0.01 


0.02 


9 M0" 4 0.58 3 M0" 4 


KL 


1.5 


0.22 0.6 1 


MO" 3 


0.8 


0.11 0.38 6 M0" 4 



E: relative error rate; % 2 : Chi-squared error; KL: Kullback-Liebler divergence; SD: 
standard deviation 



Table 3 Proportion of genes for which MGMR improves 
estimates on different data sets 





SEQEM-A 


SEQEM-B 


RSEM-A 


RSEM-B 


Proportion 


104/285 


78/285 


126/524 


1 73/524 


% 


36.5 


27.3 


24.0 


33.0 



Proportions of regions (genes for SEQEM and transcripts for RSEM, 
respectively) for which MGMR has lower relative error on average than each 
method compared to. 



Conclusion 

As shown by the 1000 Genomes and HapMap projects, 
one of the drives of modern genetics and bioinformatics 
research is to characterize variation in populations. 
Because of cost and time constraints, such projects have 
only recently become feasible. In addition to such stu- 
dies assessing genomic variation and its relation to dis- 
ease phenotypes based on DNA, it is anticipated that 
RNA-Seq population studies will also grow in popularity 
to more directly assign functional significance to variant 
loci by means of transcription measures. Thus, it 
becomes essential to accurately measure the expression 
levels from each individual to characterize such varia- 
tion. Here, we have shown that for one common study 
design an unexpected benefit can arise. When indivi- 
duals in these studies are drawn from the same popula- 
tion, the estimates made on each can be made more 
accurate because of the commonalities among popula- 
tion members. 

A shortcoming of the MGMR approach is that since it 
assumes commonality among the samples, outlier sam- 
ples will be attracted towards the common denominator, 
and thus appear more similar to the group profile than 
they really are. In particular, if the data are subject to 
differential expression analysis, MGMR may reduce the 
number of differentially expressed genes. 

We have investigated the efficacy of MGMR in tack- 
ling two typical experimental settings - inferring 
expression levels of paralogs at the gene level, and of 
isoforms (also drawn from a difficult set of paralogs). 
Although substantial gains were obtained in the first, 
more inquiry is required to demonstrate a benefit in 
the latter. It is worth noting that in each case at least 
a quarter of the regions considered showed improve- 
ment, as shown in Table 3. With these results, we sub- 
mit a proof of concept that population structure can 
aid in estimation of expression levels for RNA-Seq 
samples. 

List of abbreviations 

E: relative error rate;^ 2 : Chi-squared error; KL: Kullback-Liebler divergence; SD: 
standard deviation; bp: base pair 
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